A question about using two fixed effects that ``overlap'' with each other

Sam Hwang

Join Date: Feb 2020

Posts: 9
#1

A question about using two fixed effects that ``overlap'' with each other

10 Aug 2023, 18:07

Dear Statalist users,

We are trying to estimate the following regression model, where the unit of observation is a child:

where I{ } is an indicator function that is equal to 1 if the statement inside the curly brackets is true, and 0 otherwise.

As an example, suppose that we have four grandparent couples (denoted GP1, GP2, GP3, and GP4 in the figure below) and three grandchildren (denoted gc1, gc2, and gc3 in the figure below).

where paternal grandparents are connected to their grandchild with a solid line and maternal grandparents are connected to their grandchild with a dashed line.

Conceptually, there is nothing complicated about this model. Each observation (child) will have two dummy variables equal to 1 (one for each of his/her grandparents), and all the other dummy variables equal to zero. Continuing with the example above, the table below shows which dummy variables would be equal to 1 for each grandchild (which is the unit of observation in the above model):

However, we find it difficult to estimate this model with our sample. The reason is as follows: since there are millions of grandparent couples associated with our sample, it is not computationally feasible to have one dummy variable for each of these grandparent couples.

An alternative would be to assign each grandparent couple a numeric id, and create two variables: paternal grandparents' id and maternal grandparents' id, and use reghdfe and include these two variables as fixed effects.

However, we run into the following issue: the maternal grandparents of some children in our sample are the paternal grandparents of other children in our sample.

Continuing with the example above, we can assign to each grandchild the values of his/her paternal grandparents' id and maternal grandparents' id, as follows:

In this example, if we use the id's of paternal and maternal grandparents as separate fixed effects (using, e.g., reghdfe), we cannot let Stata know that ``2'' in the paternal-grandparents-fixed effects variable is the same grandparents as ``2'' in the maternal-grandparents-fixed effect variable.

We are wondering if anyone knows any Stata command that can handle this case. Any help would be greatly appreciated.

Thank you!

Sam
Tags: None

Daniel Schaefer

Join Date: Mar 2020
Posts: 822

11 Aug 2023, 12:12

Hi Sam,

this is a tricky problem! I notice your second image looks a bit like data in wide format. I think you might want something like this:

Code:

clear
input int(gcid gp1 gp2 gp3 gp4) str20(var1) int(var2)
1 1 1 0 0 "preserve" 1
2 0 1 1 0 "this" 2
3 0 1 0 1 "data" 3
end

label values var2 var2_label
label def var2_label 1 "preserve" 2 "this" 3 "too"

reshape long gp, i(gcid) j(gpid)
rename gp is_grandparent
list

Code:

. list

     +----------------------------------------------+
     | gcid   gpid   is_gra~t       var1       var2 |
     |----------------------------------------------|
  1. |    1      1          1   preserve   preserve |
  2. |    1      2          1   preserve   preserve |
  3. |    1      3          0   preserve   preserve |
  4. |    1      4          0   preserve   preserve |
  5. |    2      1          0       this       this |
     |----------------------------------------------|
  6. |    2      2          1       this       this |
  7. |    2      3          1       this       this |
  8. |    2      4          0       this       this |
  9. |    3      1          0       data        too |
 10. |    3      2          1       data        too |
     |----------------------------------------------|
 11. |    3      3          0       data        too |
 12. |    3      4          1       data        too |
     +----------------------------------------------+

And now it should be straightforward to treat this as a fixed effects model. I don't think you need reghdfe for this.

Comment

Sam Hwang

Join Date: Feb 2020

Posts: 9
#3

12 Aug 2023, 09:43

Hi Daniel,

Thank you so much for your reply.

If I may ask follow-up questions, after ``reshape long'' has been implemented in your example, each grandchild has four lines of data (or however many there are grandparents in the entire dataset). Then I guess we need to drop observations for which is_grandparent==0, then we could do areg y X,absorb(gpid)? Am I understanding your suggestion correctly?

Another question is, and this may be an elementary question, but I am having a hard time seeing how this can be equivalent to the original model with one line of data for each grandchild with two grandparent-couple dummies. Some explanation would be greatly appreciated.

Thank you,

Sam
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 822
#4

12 Aug 2023, 11:40

Hi Sam,

What about a multilevel model here? It makes sense to me that you might drop every observation where is_grandparent==0, but I think you can equivalently just model this by including the is_grandparent variable as a predictor.

Code:

help mixed

After running the code in #2, you have a dataset with clusters. I would treat this as a random effects multilevel model, but in this case I think you might be able to use the regular regress command with the vce cluster option set.

Code:

areg y X,absorb(gpid)

Hmm, I can see what you are going for here, but I don't think this correctly deals with the new data structure. Maybe you can correct your standard errors with vce(robust)?

I am having a hard time seeing how this can be equivalent to the original model with one line of data for each grandchild with two grandparent-couple dummies.

What I like about this strategy is that it maintains the full set of relationships, but doesn't treat maternal grandparents and paternal grandparents as having separate effects. As to whether it is equivalent, in terms of the amount of the variance in your outcome you account for, it should be. But it won't give you separate coefficients for each grandparent. That's what you want, right? To control for grandparent effects without estimating a separate regression coefficient for every one of your "millions of grandparent couples"?
1 like
Comment
Sam Hwang

Join Date: Feb 2020

Posts: 9
#5

15 Aug 2023, 13:30

Hi Daniel,

Thank you very much for your re-reply to our reply. Your re-reply clarified what we need to do to execute your suggestion. And you are right when you said,

"That's what you want, right? To control for grandparent effects without estimating a separate regression coefficient for every one of your "millions of grandparent couples"?"

One of your suggestions that we cannot implement is to keep observations with is_grandparent==0, as you suggested below:

"What about a multilevel model here? It makes sense to me that you might drop every observation where is_grandparent==0, but I think you can equivalently just model this by including the is_grandparent variable as a predictor."

One issue with this for our context is that we have millions of grandparents, which means that the new data structure would result in too large a dataset (since we have millions of children times millions of grandparents). I am not sure if Stata can handle such a large dataset.

We will experiment with your suggestions. Thank you!

Sam
Comment

Announcement

A question about using two fixed effects that ``overlap'' with each other

Comment

Comment

Comment

Comment